138 research outputs found
Measurement Error in Lasso: Impact and Correction
Regression with the lasso penalty is a popular tool for performing dimension
reduction when the number of covariates is large. In many applications of the
lasso, like in genomics, covariates are subject to measurement error. We study
the impact of measurement error on linear regression with the lasso penalty,
both analytically and in simulation experiments. A simple method of correction
for measurement error in the lasso is then considered. In the large sample
limit, the corrected lasso yields sign consistent covariate selection under
conditions very similar to the lasso with perfect measurements, whereas the
uncorrected lasso requires much more stringent conditions on the covariance
structure of the data. Finally, we suggest methods to correct for measurement
error in generalized linear models with the lasso penalty, which we study
empirically in simulation experiments with logistic regression, and also apply
to a classification problem with microarray data. We see that the corrected
lasso selects less false positives than the standard lasso, at a similar level
of true positives. The corrected lasso can therefore be used to obtain more
conservative covariate selection in genomic analysis
A genetic and spatial Bayesian analysis of mastitis resistance
A nationwide health card recording system for dairy cattle was introduced in Norway in 1975 (the Norwegian Cattle Health Services). The data base holds information on mastitis occurrences on an individual cow basis. A reduction in mastitis frequency across the population is desired, and for this purpose risk factors are investigated. In this paper a Bayesian proportional hazards model is used for modelling the time to first veterinary treatment of clinical mastitis, including both genetic and environmental covariates. Sire effects were modelled as shared random components, and veterinary district was included as an environmental effect with prior spatial smoothing. A non-informative smoothing prior was assumed for the baseline hazard, and Markov chain Monte Carlo methods (MCMC) were used for inference. We propose a new measure of quality for sires, in terms of their posterior probability of being among the, say 10% best sires. The probability is an easily interpretable measure that can be directly used to rank sires. Estimating these complex probabilities is straightforward in an MCMC setting. The results indicate considerable differences between sires with regards to their daughters disease resistance. A regional effect was also discovered with the lowest risk of disease in the south-eastern parts of Norway
Pair-copula constructions of multiple dependence
Building on the work of Bedford, Cooke and Joe, we show how multivariate data, which exhibit complex patterns of dependence in the tails, can be modelled using a cascade of pair-copulae, acting on two variables at a time. We use the pair-copula decomposition of a general multivariate distribution and propose a method to perform inference. The model construction is hierarchical in nature, the various levels corresponding to the incorporation of more variables in the conditioning sets, using pair-copulae as simple building blocs. Pair-copula decomposed models also represent a very flexible way to construct higher-dimensional coplulae. We apply the methodology to a financial data set. Our approach represents the first step towards developing of an unsupervised algorithm that explores the space of possible pair-copula models, that also can be applied to huge data sets automatically
Diverse personalized recommendations with uncertainty from implicit preference data with the Bayesian Mallows Model
Clicking data, which exists in abundance and contains objective user
preference information, is widely used to produce personalized recommendations
in web-based applications. Current popular recommendation algorithms, typically
based on matrix factorizations, often have high accuracy and achieve good
clickthrough rates. However, diversity of the recommended items, which can
greatly enhance user experiences, is often overlooked. Moreover, most
algorithms do not produce interpretable uncertainty quantifications of the
recommendations. In this work, we propose the Bayesian Mallows for Clicking
Data (BMCD) method, which augments clicking data into compatible full ranking
vectors by enforcing all the clicked items to be top-ranked. User preferences
are learned using a Mallows ranking model. Bayesian inference leads to
interpretable uncertainties of each individual recommendation, and we also
propose a method to make personalized recommendations based on such
uncertainties. With a simulation study and a real life data example, we
demonstrate that compared to state-of-the-art matrix factorization, BMCD makes
personalized recommendations with similar accuracy, while achieving much higher
level of diversity, and producing interpretable and actionable uncertainty
estimation.Comment: 27 page
Unsupervised empirical Bayesian multiple testing with external covariates
In an empirical Bayesian setting, we provide a new multiple testing method,
useful when an additional covariate is available, that influences the
probability of each null hypothesis being true. We measure the posterior
significance of each test conditionally on the covariate and the data, leading
to greater power. Using covariate-based prior information in an unsupervised
fashion, we produce a list of significant hypotheses which differs in length
and order from the list obtained by methods not taking covariate-information
into account. Covariate-modulated posterior probabilities of each null
hypothesis are estimated using a fast approximate algorithm. The new method is
applied to expression quantitative trait loci (eQTL) data.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS158 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Indirect genomic effects on survival from gene expression data
A novel methodology is presented for detecting and quantifying indirect effects on cancer survival mediated through several target genes of transcription factors in cancer microarray data
- …